-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement basic version of string to float/double/decimal #870
Conversation
@@ -142,6 +142,8 @@ pub struct Cast { | |||
/// When cast from/to timezone related types, we need timezone, which will be resolved with | |||
/// session local timezone by an analyzer in Spark. | |||
pub timezone: String, | |||
|
|||
pub allow_incompat: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets have a comment on this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@@ -555,17 +555,51 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { | |||
castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.FloatType) | |||
} | |||
|
|||
test("cast StringType to FloatType (partial support)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can have another test showing what is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do have failing tests already that are ignored:
ignore("cast StringType to FloatType") {
// https://github.com/apache/datafusion-comet/issues/326
castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.FloatType)
}
ignore("cast StringType to DoubleType") {
// https://github.com/apache/datafusion-comet/issues/326
castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.DoubleType)
}
ignore("cast StringType to DecimalType(10,2)") {
// https://github.com/apache/datafusion-comet/issues/325
val values = gen.generateStrings(dataSize, numericPattern, 8).toDF("a")
castTest(values, DataTypes.createDecimalType(10, 2))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it should be good, considering DF already supports such casts. However I'm thinking if we should have a new struct field allow_noncompat
or introduce new Spark Comet property? Spark Comet property probably is easier to drop when we dont need this logic anymore
@comphead The new stuct field is populated via the config |
@@ -711,7 +728,11 @@ fn cast_array( | |||
|
|||
/// Determines if DataFusion supports the given cast in a way that is | |||
/// compatible with Spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Shall we update the documentation comment to reflect the addition of the allow_incompat
parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_incompat
is an internal API so I don't think we need to add antyhing to the user guide. We do already have documentation for the spark.comet.cast.allowIncompatible
config, which is used to populate allow_incompat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the PR @andygrove
Thanks for the reviews @comphead @viirya @huaxingao |
Which issue does this PR close?
Related to #326 and #325
Rationale for this change
When running aggregate queries against schemas where numbers are in string format (csv files, parquet files generated from csv files, etc), Spark will add casts from string to floating point or decimal. We fall back to Spark because we did not have these casts implemented.
This is frustrating for casual users who just want to try Comet out on some text inputs.
What changes are included in this PR?
This PR adds a basic implementation of these casts by delegating to DataFusion. These are not compatible with Spark and are documented as such and are only enabled if
spark.comet.cast.allowIncompatible
is enabled.I recommend viewing the diff without whitespace changes (https://github.com/apache/datafusion-comet/pull/870/files?diff=split&w=1)
How are these changes tested?
New tests